Skip to content

Conversation

hanwen-cluster
Copy link
Contributor

@hanwen-cluster hanwen-cluster commented Sep 19, 2025

Description of changes

The new NCCL version has some performance improvement on Blackwell.

This upgrade makes NCCL performance on two p6-b200 15% better

Therefore, this PR also updates the baseline numbers.

See commit descriptions for details

Tests

  • NCCL test has passed on RHEL9 with better performance. NCCL test on other OSes is running

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The new NCCL version has some performance improvement on Blackwell.
See NCCL release note: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-28-3.html#rel_2-28-3

This upgrade makes NCCL performance on two p6-b200 15% better
@hanwen-cluster hanwen-cluster requested review from a team as code owners September 19, 2025 18:12
@hanwen-cluster hanwen-cluster changed the title [integ-tests] Upgrade NCCL versions [integ-tests] Upgrade NCCL versions and increase baseline numbers Sep 19, 2025
gmarciani
gmarciani previously approved these changes Sep 19, 2025
The baselines are 90% of the current performance
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release-3.14@25ff751). Learn more about missing BASE report.

Additional details and impacted files
@@               Coverage Diff               @@
##             release-3.14    #7012   +/-   ##
===============================================
  Coverage                ?   90.18%           
===============================================
  Files                   ?      182           
  Lines                   ?    16472           
  Branches                ?        0           
===============================================
  Hits                    ?    14856           
  Misses                  ?     1616           
  Partials                ?        0           
Flag Coverage Δ
unittests 90.18% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hanwen-cluster hanwen-cluster enabled auto-merge (rebase) September 19, 2025 19:34
@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Sep 22, 2025
@hanwen-cluster hanwen-cluster merged commit 4a4c878 into aws:release-3.14 Sep 22, 2025
40 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog-update Disables the check that enforces changelog updates in PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants